Python Logo

Introduction to Python with Application in Bioinformatics¶

Nanjiang Shu¶

2024-07-15 (Day 1)¶

Who we are¶

Nanjiang Shu Can Hou Yike Gong Lei Liu Yihan Hu
Drawing Drawing Drawing Drawing Drawing

Who you are¶

Introduce yourself!

Who you are - research experience¶

research fields

The above figure is generated by Python

Who you are - programming experience¶

research fields

The above figure is generated by Python

Who you are - bioinformatics experience¶

research fields

The above figure is generated by Python

Schedule¶

Drawing

Practical issues¶

  • Course website: https://python-bioinfo.bioshu.se/
  • Lectures with many exercises
  • Important concepts will be repeated and extented in the next few days
  • Schedule times are approximate
  • Ask TAs if you encounter any problems during the course
  • Discuss with your classmates when doing exercises

Check¶

  • Has everyone managed to install Python?
  • Have you managed to run the test script?
  • Have you installed Jupyter notebooks? (optional)

Why programming?¶

Typical workflow¶

  1. Get data
  2. Clean, transform data in spreadsheet
  3. Copy-paste, copy-paste, copy-paste
  4. Run analysis & export results
  5. Realise the columns were not sorted correctly
  6. Go back to step 2, Repeat

Drawing

With programming, you can automate some manual procesures

Why Python?¶

  • Readability and simplicity
In [ ]:
# In Python
print("Hi, Python!")
In [ ]:
/* In C++ */
#include <iostream>

int main() {
    std::cout << "Hi, Python!" << std::endl;
    return 0;
}

Why Python?¶

Popularity, which means extensive libraries and strong community support¶

Drawing

Why Python?¶

Integration and Extensibility¶

  • Integration with C.
    • Numpy, which builds the fundatation for the popular deep learning package - Tensorflow
  • pandas to read, manipulate and write Excel files programmtically

Python is under active development¶

Old versions Python 3
Python 1.0 - January 1994 Python 3.0 - December 3, 2008
Python 1.0 - January 1994 Python 3.1 - June 27, 2009
Python 1.2 - April 10, 1995 Python 3.2 - February 20, 2011
Python 1.3 - October 12, 1995 Python 3.3 - September 29, 2012
Python 1.4 - October 25, 1996 Python 3.4 - March 16, 2014
Python 1.5 - December 31, 1997 Python 3.5 - September 13, 2015
Python 1.6 - September 5, 2000 Python 3.6 - December 23, 2016
Python 2.0 - October 16, 2000 Python 3.7 - June 27, 2018
Python 2.1 - April 17, 2001 Python 3.8 - October 14, 2019
Python 2.2 - December 21, 2001 Python 3.9 - October 5, 2020
Python 2.3 - July 29, 2003 Python 3.10 - October 4, 2021
Python 2.4 - November 30, 2004 Python 3.11 - October 24, 2022
Python 2.5 - September 19, 2006 Python 3.12 - October 2, 2023
Python 2.6 - October 1, 2008
Python 2.7 - July 3, 2010

Course content

  • Core concepts about Python syntax: Data types, blocks and indentation, variable scoping, iteration, functions, methods and arguments
  • Different ways to control program flow using loops and conditional tests
  • Writing functions and scripts and making them usable
  • Reading from and writing to files
  • Code packaging and Python libraries
  • How to work with biological data using external libraries.
  • How to use AI to assist your programming and to help with futrue study

Learning outcomes¶

At the end of the course, you should be able to:

  • Use variables and exlain how operators work
  • Process data using loops
  • Separate data using if/else statements
  • Use functions to read and write files
  • Describe your own approach to a coding task
  • Understand the difference between functions and methods
  • Be able to read the documentation for built-in functions and methods
  • Understand the concept and syntax of a function

Learning outcomes, cont.¶

  • Write basic functions for processing data
  • Describe pandas dataframes
  • Give examples of how to use pandas for processing data
  • Make plots with Python from CSV or Excel files
  • Combine basic concepts to create functional stand-alone programs to process data
  • Write file processing Python programs that produce output to the terminal and/or external files
  • Know how to further develop your skills in Python after the course

Day 1¶

  • Session 1:
    • Data Types, variables, and their basic operations
  • Session 2:
    • Built-in functions and operations
  • Session 3: (after lunch)
    • Loops
  • Session 4:
    • Conditional Statements: if/else

Session 1: Data types, variables, and their basic operations¶

In [92]:
print(1 + 1)
print("Hello Python")
2
Hello Python
In [93]:
a = 1 + 1
print(a)
a = "Hello Python"
print(a)
2
Hello Python
  • Fixed values, e.g. 1 and Hello Python, in the Python code are called liternals
  • Liternals are immutable
  • The name a that holds the value is called a variable

Types of Liternals¶

  • Numeric Literals
    • Integer, e.g. 42, -5, 1000
    • Float, e.g. 3.14, -0.002, 10.0
  • String Literals
    • Represent sequences of characters, e.g. Hello Python, ATCG
  • Boolean Literals
    • Only two values: True and False.
  • None Literal
    • Represents the absence of a value or a null value. It has the value None

1. The literal's type determines the variable's type¶


2. In Python, the data type is inferred automatically¶

In [ ]:
a = 1
a = "ATCG"
a = True
a = None
In [ ]:
# Can you tell the types of them?
sequence_length = 200
scale = 2.5
gene_id = "ABC12345"
is_DNA = False

Use type() function to determine the type of a variable¶

In [ ]:
 

Use print() function to display the value of a variable¶

In [ ]:
sequence_length = 200
print(sequence_length)

Collection data types¶

  • List and Tuple
  • Set
  • Dictionary

List and Tuple¶

  • List and tuple are ordered collection of elements
In [ ]:
seq_len = 200
seq_lens = [100, 150, 200] # a list 
print(seq_lens[1])
In [ ]:
seq_lens = (100, 150, 200) # a tuple
print(seq_lens[1])

Elements in a list or tuple can be homogeneous (all the same type) or heterogeneous (different types), and they can be of any data type.¶

In [ ]:
li = [100, 150, None, "ATCG", 3.1415, seq_len, seq_lens]
li

Difference of List and Tuple¶

  • List is mutable
  • Tuple is immutable
In [ ]:
li_seqlens = [100, 150, 200]
tu_seqlens = (100, 150, 200)
In [ ]:
li_seqlens[1] = 500
print(li_seqlens)
In [94]:
tu_seqlens[1] = 500
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[94], line 1
----> 1 tu_seqlens[1] = 500

TypeError: 'tuple' object does not support item assignment

Set¶

  • Set is an unordered collection of unique elements
In [96]:
gene_ids = {"TP53", "COX2", "EGFR", "MTOR"} # a set
gene_ids
Out[96]:
{'COX2', 'EGFR', 'MTOR', 'TP53'}
In [ ]:
# set is unordered
gene_ids = {"1", "2", "3", "4", "5"}
for e in seq_lens:
    print(e)
In [ ]:
# set has unique element
gene_ids = {"1", "1", "2", "2", "3"}
print(gene_ids)

Dictionary¶

  • A dictionary is an unordered, mutable collection of key-value pairs.
  • Each key in a dictionary must be unique and immutable (e.g., strings, numbers, or tuples), while the values associated with keys can be of any data type and can be duplicated.
In [ ]:
sequence_info = {  # a dictionary
    "gene": "TP53",
    "species": "Homo sapiens",
    "length": 2000
}

What operations can we do with different values?¶

That depends on their type

Integer operations¶

  • Addition: 2 + 3 = 5
  • Subtraction: 7 - 4 = 3
  • Multiplication: 6 * 3 = 18
  • Division: 8 / 2 = 4.0
  • Modulus (remainder): 5 % 2 = 1
  • Exponent: 2 ** 3 = 8
In [98]:
8/2
Out[98]:
8

Float operations¶

  • Addition: 1.5 + 2.3 = 3.8
  • Subtraction: 5.5 - 1.2 = 4.3
  • Multiplication: 3.2 * 2.1 = 6.72
  • Division: 7.5 / 2.5 = 3.0
In [99]:
1 + 1.5
Out[99]:
float

Warning: keep in mind the precision limitations of floating-point arithmetic.¶

In [ ]:
result = 0.1 + 0.2 - 0.3
print(result)
In [ ]:
print(result == 0.0)
In [ ]:
print((result - 0.0) < 1e-6)

String operations¶

  • Concatenation: "Homo " + "Sapiens" = "Homo Sapiens"
  • Repeating: "GC" * 3 = "GCGCGC"
  • Slicing: "protein"[1:4] = "rot"
  • Length: len("ATCG") -> 4
  • Methods:
    • .upper(), .lower()
    • .replace("A", "T")
In [ ]:
"protein"[1:4]

Boolean operations¶

  • Logical AND: True and False = False
  • Logical OR: True or False = True
  • Logical NOT: not True = False
  • Comparison:
    • Equal: 2 == 2 = True
    • Not Equal: 2 != 3 = True
    • Greater Than: 5 > 3 = True
    • Less Than: 5 < 3 = False
In [ ]:
"a" == "b"

Data type conversion¶

  • Convert Float to Integer: int(3.14) = 3
  • Convert Integer to String: str(5) = "5"
  • Convert String to Float: float("3.14") = 3.14
In [ ]:
 

When you run operations on different data types, an underlying data type conversion has been made¶

In [ ]:
1 + 1.5
In [100]:
# Guess what will be the result for this?
1 + True
Out[100]:
2

How to correctly name a variable¶

Drawing

Special characters are NOT allowed:: + - * $ % ; : , ? ! { } ( ) < > “ ‘ | \ / @

Examples:¶

Valid Invalid
Var_name 2save
_total *important
aReallyLongName Special%
with_digit_2 With   spaces

What about dkfsjdsklut¶

  • well, this is a valid name, but NOT recommended`

Reserved keywords¶

Drawing

These words can not be used as variable names

In [101]:
global = 5
  Cell In[101], line 1
    global = 5
           ^
SyntaxError: invalid syntax

Best practices for naming variables in Python¶

  • Use descriptive names:

    • Choose names that describe the purpose of the variable.
    • Example: Use gene_id instead of gi.
  • Follow naming conventions for Python, e.g., snake_case:

    • Example: Use gene_name, sequence_length, sample_id.
  • Prefix Booleans with 'is', 'has', 'can', etc:

    • Example: is_high_quality, has_mutations.

Summary¶

  • Literals define values and can have different types (strings, integers, floats, boolean etc).
  • Variables are identified by a name and are used to store a value.
  • Values can be collected in lists, tuples, sets, and dictionaries.
  • The operation that can be performed on a certain value depends on its type.
  • Name your variables using descriptive words without special characters and reserved keywords.

Exercise time¶

Day 1, Exercise 1 (~30 min)¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

Short break after the exercise¶

Session 2: Built-In functions and operations¶

In [102]:
# display the value with print()
result = "ACCCG" * 5
print(result)
ACCCGACCCGACCCGACCCGACCCG
In [103]:
# show the type of value with type()
print(type(result))
<class 'str'>
In [104]:
# convert float value to string value with str()
str(2.5)
Out[104]:
'2.5'

Python standard library¶

https://docs.python.org/3.9/library/functions.html

Drawing

len()¶

In [ ]:
sequence = "ATGCTACGATaCG"
len(sequence)
In [ ]:
seq_lens = [100, 200, 300]
len(seq_lens)
In [108]:
# can you get the length of an integer
len(3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[108], line 2
      1 # can you get the length of an integer
----> 2 len(3)

TypeError: object of type 'int' has no len()

sum()¶

In [117]:
read_counts = [1500, 2000, 1750, 2250, 1900, 2500]
print("total_reads:", sum(read_counts))
total_reads: 11900

min() and max()¶

In [118]:
expression_levels =  [2.5, 3.6, 4.2, 5.0, 3.8, 3.8, 9.5, 100.1]
print("Max expression level:", max(expression_levels))
print("Min expression level:", min(expression_levels))
Max expression level: 100.1
Min expression level: 2.5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[118], line 4
      2 print("Max expression level:", max(expression_levels))
      3 print("Min expression level:", min(expression_levels))
----> 4 min(3)

TypeError: 'int' object is not iterable
In [ ]:
print("Average expression level: ", sum(expression_levels)/len(expression_levels))

sorted()¶

In [120]:
read_counts = [1500, 2000, 1750, 2250, 1900, 2500]
sorted_read_counts = sorted(read_counts)
print(sorted_read_counts)
[1500, 1750, 1900, 2000, 2250, 2500]

Drawing

  • With these built-in functions, one can easily manupultate data to achieve your purpose
  • Further reading: https://docs.python.org/3.9/library/functions.html

Comparison operators¶

Drawing

Can be used on int, float, str, and bool. Outputs a boolean.

In [122]:
seq_len1 = 150
seq_len2 = 181

seq_len1 <= seq_len2
Out[122]:
True

Logical operators¶

Drawing

In [123]:
freq1 = 0.51
freq2 = 1.5

freq1 > 0.5 and freq2 > 0.5
Out[123]:
True

Membership operators¶

Drawing

In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"] # a set

"TP53" in gene_ids

Order of precedence¶

There is an order of precedence for all operators:

Drawing

A word of caution when using operators¶

In [124]:
length = 500
species = "Mouse"
read_count = 100
# I want to evaluate the condition that length is larger than 300 or species is Mouse, 
# and read_count is larger than 200
# Expected value: False
length > 300 or species == "Mouse" and read_count > 200
Out[124]:
True
In [ ]:
(length > 300 or species == "Mouse") and read_count > 200
  • Always remember that and takes precedence over or in logical expressions.
  • Use parentheses () to make your intended grouping explicit and improve readability.

More on operations for Lists and Strings¶

Lists and strings are an ORDERED collection of elements where every element can be accessed through an index.

Drawing

In [ ]:
mylist = [1, 2, 3, 4, 5, 6, 7, 8, 9]
mylist[2]
In [ ]:
mylist[1:3]
In [ ]:
mylist[0:9:2] # [start, stop, step]
In [ ]:
mylist[3:] # from 4th position to the end
In [ ]:
mylist[:5] # from the beginning to the 5th position
In [ ]:
mylist[:] # the same as mylist[::], mylist[::1]

Use of the negative index¶

              list = [1,   2,   3 ,  4 ,   5]
positive indeces      0    1    2    3     4
negative indeces     -5   -4   -3   -2    -1

Equation: positive_index = len(li) + negative_index

In [ ]:
mylist[-1] # return the last element, equivalent to mylist[8]
In [ ]:
# What will be the result for this?
mylist[::-1]

When the step is negative, it changes the direction¶

Similar operations for Strings¶

In [ ]:
mystr = "123456789"
print("mystr[2] \t= ", mystr[2] )
print("mystr[1:3] \t= ", mystr[1:3])
print("mystr[0:9:2] \t= ", mystr[0:9:2])
print("mystr[3:] \t= ", mystr[3:])
print("mystr[:5] \t= ", mystr[:5])
print("mystr[:] \t= ", mystr[:])
print("mystr[-1] \t= ", mystr[-1])
print("mystr[::-1] \t= ", mystr[::-1])

Mutable vs Immutable objects¶

Mutable objects can be altered after creation, while immutable objects can't.

Immutable objects Mutable objects
int list
float set
bool dict
str
tuple
In [ ]:
# list is mutable
mylist = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(mylist)
mylist[2] = 7
print(mylist)
In [ ]:
# string is immutable
mystr = "123456789"
mystr[2] = "7"

Are int, float and bool immutable?¶

In [ ]:
a = 5
print("a=", a)
a = 6
print("a=", a)
In [ ]:
# id(var) returns the memory address of var.
a = 5
print("memory address of a = ", id(a), ", value of a = ", a)
a = 6
print("memory address of a = ", id(a), ", value of a = ", a)
In [126]:
mylist = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print("memory address of mylist = ", id(mylist), ", value of mylist = ", mylist)
mylist[2] = 7
print("memory address of mylist = ", id(mylist), ", value of mylist = ", mylist)
memory address of mylist =  4469238464 , value of mylist =  [1, 2, 3, 4, 5, 6, 7, 8, 9]
memory address of mylist =  4469238464 , value of mylist =  [1, 2, 7, 4, 5, 6, 7, 8, 9]

Operations on mutable sequences¶

Drawing

In [131]:
mylist = [1, 2, 3, 4, 5, 6, 7, 8, 9]
mylist.append(15)
print(mylist)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 15]
In [132]:
mylist.remove(15)
print(mylist)
[1, 2, 3, 4, 5, 6, 7, 8, 9]
In [134]:
del mylist[2]
print(mylist)
[1, 2, 5, 6, 7, 8, 9]

Summary¶

  • The Python standard library has many built-in functions that regularly used
  • Operators are used to carry out computations on different values
  • Three types of operators that return a boolean: comparison, logical, and membership
  • Order of precedence is crucial - use parentheses () for clarity!
  • Mutable object can be changed after creation while immutable objects cannot be changed

Day 1, Exercise 2 (~30 min)¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

Break after the exercise¶


Day 1, quiz¶

  • Link: https://python-bioinfo.bioshu.se/quiz.html

Lunch¶

Afternoon session¶


1. Loops¶

  • Exercise 3 of Day 1

Break¶

2. if/else statement¶

  • Exercise 4 of Day 1

Session 3: Loops in Python¶

In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"]
# How do we print all IDs, one per line?
In [ ]:
print(gene_ids[0])
print(gene_ids[1])
print(gene_ids[2])
print(gene_ids[3])
In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"]

for gene_id in gene_ids:
    print(gene_id)

Note the INDENT of the for loop

Indentation is crucial in Python¶

  • Blocks of code are defined by their indentation level.
  • Typically, a tab or four spaces are used for each indentation level, but consistency within a block is the key.
  • Don't mix tabs and spaces, although it's allowed.
In [ ]:
for gene_id in gene_ids:
    print(gene_id)
In [ ]:
for gene_id in gene_ids:
    print("==ID==")
    print(gene_id)
In [ ]:
for gene_id in gene_ids:
        print("==ID==")
        print(gene_id)
In [ ]:
for gene_id in gene_ids:
        print("==ID==")
    print(gene_id)

Types of loops¶

For loop¶

In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"]

for gene_id in gene_ids:
    print(gene_id)

While loop¶

In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"]

i = 0
while i < len(gene_ids):
    print(gene_ids[i])
    i += 1

print("== When loop ends, i =", i)

Types of loops¶

For loop

Is a control flow statement that performs a fixed operation over a known amount of steps.

While loop

Is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition.


Which one to use?

For loops better for simple iterations over lists and other iterable objects

While loops are more flexible and can iterate an unspecified number of times

Examples of while loop¶

In [ ]:
gene_ids = ["TP53", "COX2", "EGFR", "MTOR"]

i = 0
while i < len(gene_ids) and not gene_ids[i].startswith("E"):
    print(gene_ids[i])
    i += 1

print("== When loop ends, i =", i)
In [ ]:
while True:
    print("yes")

Note: there is one built-in function called range() which is especially useful for the for loop

The range() function¶

In [ ]:
for i in range(10):
    print(i)
In [ ]:
file_basename = "dnaseq"
for i in range(10):
    seqfile = file_basename + "_" + str(i) + ".fa"
    print("Analyzing " + seqfile)

Summary¶

  • Python has two types of loops: for loops and while loops.
  • Loops can be used with any iterable types and objects to perform repeated tasks.
  • for loops are better suited for iterating over lists and other iterable objects when the number of iterations is known or finite.
  • while loops offer more flexibility and can iterate an unspecified number of times, as they continue until a specified condition is no longer true.
  • The range() function is useful for programming with loops.

Day 1, Exercise3 (~30 minutes)¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

10 min break after the exercise¶

Session 4: Conditional statement¶

  • Conditional statements allow decision-making in a program.
  • Python uses if, elif, and else for conditionals.
In [ ]:
if condition1:
    # executed if condition1 is True
elif condition2:
    # executed if condition1 is False and condition2 is True
else:
    # executed if both condition1 and condition2 are False

The if statement¶

  • The if statement is the fundamental control statement that allows Python to execute based on a condition.
In [ ]:
dna_sequence = "AGTCTCG"
if 'N' not in dna_sequence:
    print("Valid DNA sequence.")

The else statement¶

  • An else statement follows an if and defines what to do if the if condition is not met.
In [ ]:
expression_level = 35
if expression_level > 50:
    print("Gene is overexpressed.")
else:
    print("Gene is not overexpressed.")

The elif statement¶

  • The elif (shortname for "else if") allows chaining of conditional statements.
In [ ]:
expression_level = 35
if expression_level > 100:
    print("Gene is overexpressed.")
elif expression_level > 30 and expression_level <= 100:
    print("Gene is expressed.")
else:
    print("Gene is underexpressed.")

Nested conditional statements¶

  • Nested conditions occur when a conditional statement is placed inside another conditional statement.
  • It allows for more specific and granular control based on multiple criteria.
In [ ]:
# Use nested conditionals to categorize genetic variants based on multiple attributes.
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant " + genotype + " is active and expressed.")
    else:
        print("Variant " + genotype + " is active but not expressed.")
else:
    print("Variant " + genotype + " is a non-target variant.")

Summary¶

  • if, elif, and else are powerful tools for controlling program logic.
  • if/elif/else statements can be nested. However, it is advisable to avoid excessive nesting, as this can make the code difficult to read and maintain.

Day 1, Exercise 4 (~30min)¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

End of Day 1¶